Load package

1 - ex.1 in LSLR charpter2

1

  1. Since the number of observations(n) is extremely large and the number of predictors(p) is relative small, I prefer a flexible method. Because n/p >> 20, it will not have a serious problem of the degree of freedom.

  2. Since the number of observations(n) is small and the number of predictors(p) is extremely large, I prefer a inflexible method. Because n/p << 20, it will cause a serious problem of the degree of freedom.

  3. I do not think the non-linear relationship between the predictors and response will effect on the flexibility of the model. But non-linear model or method may have relatively high flexibility compared with linear model like KNN vs. linear regression model.

  4. Large variance term may lead to a flexible model in order to reduce the MSE and improve prediction.

3

A caption

A caption

Training MSE line decreases because when flexibility increasing, the model fits the training data better and better. At the same time, test MSE will decrease first and then increase. Because when method becomes complex, its performance in test data will increase. Once the method is overfitting the training data, the performance in test data set will decrease and the test MSE increases. Var is the irreducible error, thus it keeps constant in the plot. The bias curve decreases monotonically because when the method becomes more complex it could more close to the real life problem which means the bias will decrease. The variance increases monotonically, because when the model fits the training dataset really well, when change a dataset, the performance will decrease which will cause the variance increase.

8

college <- read.csv("College.csv")
rownames (college )=college [,1] #name the row of college
fix (college )
college =college [,-1] #delete the last column of the college data set
fix (college )
summary(college)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00
pairs(college[,1:10])

plot(Outstate ~ Private, data = college)

Elite =rep ("No",nrow(college ))
Elite [college$Top10perc >50]=" Yes"
Elite =as.factor (Elite)
college =data.frame(college ,Elite)
summary(college$Elite)
##  Yes   No 
##   78  699
plot(Outstate ~ Elite, data = college)

par(mfrow=c(2,2))
hist(college$Top10perc, breaks = 10, xlab = "Top10perc", main="histogram of Top10perc with break=10")
hist(college$Top10perc, breaks = 5, xlab = "Top10perc",
     main="histogram of Top10perc with break=5")
hist(college$Top25perc, breaks = 10, xlab = "Top25perc", main="histogram of Top25perc with break=10")
hist(college$Top25perc, breaks = 5, xlab = "Top25perc", main="histogram of Top25perc with break=5")

percentage <- college$Accept/college$Apps
percentage <- cbind.data.frame(percentage,college$Private)
pvalue <- t.test(percentage[which(percentage$`college$Private`=="Yes"),][,1],percentage[which(percentage$`college$Private`=="No"),][,1])$p.value
pvalue <- round(pvalue,digits = 4)
ggplot(data=college,aes(x=Private, y=Accept/Apps))+
  geom_boxplot(col="black",alpha=0.8, fill=c("yellow","purple")) +
  geom_quasirandom(dodge.width=0.9,alpha=.4) +
  theme_bw()+
  geom_signif(comparisons = list(c("No","Yes")), 
              map_signif_level=TRUE,test = "t.test") +
  labs(y="percentage of acceptance") +
  scale_x_discrete(    breaks=c("No", "Yes"),
                       labels=c("Public University", "Private University")) +
  annotate(geom="text",x=1.5, y=1.1, label=paste("p.value (student t test):",pvalue),
           color="black",size=3) +
  labs(title = "Boxplot of Accept Percentage",
        x = "",
       y = "Percentage of Acceptance") +
  theme(axis.text.x=element_text(size=13),
        axis.text.y=element_text(size=13),
        axis.title.x=element_text(size=15),
        axis.title.y=element_text(size=15),
        legend.text= element_text(size=13),
        legend.title = element_text(size=10),
        title = element_text(size=20))

We could tell from the plot that there is statistical significant difference of the acceptance of application between the Public University and Private University. It is interesting that the mean acceptance of Private University is higher than the public, there are some Private University has extremely low acceptance. Also, there are much more student apply for Private University than the Public University.

2 - ex.9 in LSLR charpter3

(a)

(b)

cor(Auto[ , -which(names(Auto) %in% "name")])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c)

summary(lm(mpg ~ cylinders + displacement + horsepower +
           weight + acceleration + year + origin, data = Auto))
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

According to the p value in linear regression result, displacement, weight, year and origin have statistically significant effect on the outcome or response, mpg. While the cylinder and acceleration have no significant effect on mpg.

(d)

plot(lm(mpg ~ cylinders + displacement + horsepower +
           weight + acceleration + year + origin, data = Auto))

Based on the residuals plot, the residuals are around 0, and do not have several problems.

According to the normal QQplot, the linear regression model fit the data well except for several data points like point 32, point 323 and etc.

Point 327 and 394 are outlines but point 14 is the leverage point in this linear model.

(e)

fit <- lm(mpg ~ cylinders + displacement + horsepower +
           weight + acceleration + year + origin, data = Auto)
fit1 <- lm(mpg ~ cylinders * displacement + horsepower +
           weight + acceleration + year + origin, data = Auto)
anova(fit, fit1)
fit2 <- lm(mpg ~ cylinders + displacement * horsepower +
           weight + acceleration + year + origin, data = Auto)
anova(fit, fit2)
fit3 <- lm(mpg ~ cylinders + displacement + horsepower *
           weight + acceleration + year + origin, data = Auto)
anova(fit, fit3)
fit4 <- lm(mpg ~ cylinders + displacement + horsepower +
           weight * acceleration + year + origin, data = Auto)
anova(fit, fit4)
fit5 <- lm(mpg ~ cylinders + displacement + horsepower +
           weight + acceleration * year + origin, data = Auto)
anova(fit, fit5)
fit6 <- lm(mpg ~ cylinders + displacement + horsepower +
           weight + acceleration + year * origin, data = Auto)
anova(fit, fit6)

Any interaction between the two variables in Auto data set except “mpg” and “name” has significant effect on the linear regression model.

(f)

car::boxCox(Auto$mpg ~ Auto$cylinders + Auto$displacement + Auto$horsepower + Auto$weight + Auto$acceleration + Auto$year + Auto$origin)

powerTransform(Auto$mpg ~ Auto$cylinders + Auto$displacement + Auto$horsepower + Auto$weight + Auto$acceleration + Auto$year + Auto$origin)
## Estimated transformation parameter 
##         Y1 
## -0.3561452

The Box-Cox plot peaks at the value lambda = -0.36, which is pretty close to lambda = -0.5.

summary(lm(1/sqrt(mpg) ~ cylinders + displacement + horsepower +
           weight + acceleration + year + origin, data = Auto))
## 
## Call:
## lm(formula = 1/sqrt(mpg) ~ cylinders + displacement + horsepower + 
##     weight + acceleration + year + origin, data = Auto)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.032317 -0.007224 -0.000136  0.006959  0.046613 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.376e-01  1.734e-02  19.472  < 2e-16 ***
## cylinders     3.249e-03  1.207e-03   2.692  0.00742 ** 
## displacement -6.110e-05  2.806e-05  -2.178  0.03003 *  
## horsepower    2.154e-04  5.147e-05   4.184 3.55e-05 ***
## weight        2.606e-05  2.434e-06  10.705  < 2e-16 ***
## acceleration  4.418e-04  3.690e-04   1.197  0.23191    
## year         -3.024e-03  1.903e-04 -15.890  < 2e-16 ***
## origin       -3.297e-03  1.038e-03  -3.175  0.00162 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01242 on 384 degrees of freedom
## Multiple R-squared:  0.8896, Adjusted R-squared:  0.8876 
## F-statistic: 442.2 on 7 and 384 DF,  p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(lm(1/sqrt(mpg) ~ cylinders + displacement + horsepower +
           weight + acceleration + year + origin, data = Auto),which = 1)
plot(lm(mpg ~ cylinders + displacement + horsepower +
           weight + acceleration + year + origin, data = Auto),which = 1)

plot(lm(1/sqrt(mpg) ~ cylinders + displacement + horsepower +
           weight + acceleration + year + origin, data = Auto),which = 2)
plot(lm(mpg ~ cylinders + displacement + horsepower +
           weight + acceleration + year + origin, data = Auto),which = 2)

plot(lm(1/sqrt(mpg) ~ cylinders + displacement + horsepower +
           weight + acceleration + year + origin, data = Auto),which = 3)
plot(lm(mpg ~ cylinders + displacement + horsepower +
           weight + acceleration + year + origin, data = Auto),which = 3)

plot(lm(1/sqrt(mpg) ~ cylinders + displacement + horsepower +
           weight + acceleration + year + origin, data = Auto),which = 4)
plot(lm(mpg ~ cylinders + displacement + horsepower +
           weight + acceleration + year + origin, data = Auto),which = 4)

After transformation, the residuals have decreased obviously.

3 -Final prediction model

testing dataset

What I learnt by seeing others’ solution.

  1. In real data set, there are a lot of missing data. Thus, before doing analysis, it is better to do an exploratory data analysis and see the pattern of the data set. Then, we could decide to remove some or impute. Also, there are some duplicate observations should be moved.

  2. According to Couronné et.al BMC bioinformatics, it seems that in the classification question, random forest model is a little better than logistic regression in accuracy. But before doing random forest classification, features should be coded in factors format.

  3. Combining two variables into one could reduce the flexibility of the model.

Data dimision and data structure

training <- titanic_train
testing <- titanic_test
titanic_all <- bind_rows(training,testing)
dim(training)
## [1] 891  12
dim(testing)
## [1] 418  11
dim(titanic_all)
## [1] 1309   12
titanic_all$Survived <- as.factor(titanic_all$Survived)
titanic_all$Pclass <- as.factor(titanic_all$Pclass)
titanic_all$Sex <- as.factor(titanic_all$Sex)
titanic_all$Cabin <- as.factor(titanic_all$Cabin)
titanic_all$Embarked <- as.factor(titanic_all$Embarked)

Clean data

ifelse(length(unique(titanic_all[,1])) == nrow(titanic_all),"No duplicates","Duplicates detected!")
## [1] "No duplicates"

Impute missing data

sum(is.na(titanic_all))
## [1] 682
# replace missing values with NA across all features
for (i in 1:ncol(titanic_all)){
  titanic_all[,i][ titanic_all[,i]== ""] <- NA
}

# define a function to get number of NAs in each feature
getNA <- function(dt,NumCol){
       varsNames <- names(dt)
        NAs <- 0

        for (i in 1:NumCol){
          NAs <- c(NAs, sum(is.na(dt[,i])))
        }

        NAs <- NAs[-1]
        names(NAs)<- varsNames # make a vector of variable name and count of NAs

        NAs <- NAs[NAs > 0]
        NAs 
}

getNA(titanic_all,ncol(titanic_all))
## Survived      Age     Fare    Cabin Embarked 
##      418      263        1     1014        2

Survived is the outcome in the training data, thus I will exclude this variable in the imputing step. Cabin missed a lot of data, over 5/7, so it cannot be a good predictor in the model. Age and Embarked could be imputed by other variables.

Imputing Embarkation missing values

Based on the common sense, Fare, Class and Embarked have some relationship. Also, through plot and analysis, we could assume the missing data in Embarked.

titanic_all[is.na(titanic_all$Embarked)>0,]
FareClassComp <- titanic_all %>% filter(!is.na(Embarked))
## Warning: package 'bindrcpp' was built under R version 3.4.4
FareClassComp %>% 
        ggplot(aes(x = Embarked, y = Fare, fill = Pclass))+
        geom_boxplot()+
        geom_hline(aes(yintercept = 80),
                   colour = "red", linetype = "dashed", lwd = 2)+
        scale_y_continuous(labels = dollar_format())+
        theme_few()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

From the box plot we could see that Embarked(C) has the highest median Fare - 80 dollars which is the same with the two passenger who had missing Embarked value. Thus, I have more confidence to say that both passengers with missing embarkation values had embarked off the same port.

titanic_all[is.na(titanic_all$Fare)>0,]
titanic_all$Embarked[is.na(titanic_all$Embarked)] <- "C"

Now we only have one missing value in Fare. From the above box plot, we could use the median Fare value of Embarked(S) to predict, since the passenger is embarked off at S.

titanic_all$Fare[titanic_all$PassengerId == 1044] <-  median(titanic_all$Fare[titanic_all$Pclass == 3 & titanic_all$Embarked == "S"], na.rm = T)

Imputing age

Normally, elder people have much more reasons like healthy problem and money to buy an expensive ticket and live in a better class or cabin. Thus, I would like to use Fare to impute Age.

set.seed(471)
titanic_all <- titanic_all %>%
  impute_rlm(Age ~ Fare)

Creat new feature

I learnt from others that the combination of two variables could reduce the flexibility of the model. Thus I will generate the new feature - family size (Famsize) by SibSp and Parch.

titanic_all$Famsize <- titanic_all$SibSp + titanic_all$Parch + 1

Exploratory of the data analysis

Codebook of the variables PassengerId Passenger ID Survived Passenger Survival Indicator Pclass Passenger Class Name Name Sex Sex Age Age SibSp Number of Siblings/Spouses Aboard Parch Number of Parents/Children Aboard Ticket Ticket Number Fare Passenger Fare Cabin Cabin Embarked Port of Embarkation Famsize Family Size of Passenger

Separate data set into training and testing data sets.

train <- titanic_all[!is.na(titanic_all$Survived),]
test <- titanic_all[is.na(titanic_all$Survived),]

The histogram of training data set.

train %>% 
ggplot(aes(x=Survived, fill = Survived))+
        geom_histogram(stat = "count")+
        #labs(x = "Survival in the Titanic tragedy")+
        geom_label(stat='count',aes(label=..count..))+
        labs(fill = "Survival (0 = died, 1 = survived)")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

ggplot(train,aes(x=Survived, fill=Pclass))+
  geom_histogram(stat = "count")+
        labs(x = "Survival vs Class")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

ggplot(train,aes(x=Survived, fill=Sex))+
  geom_histogram(stat = "count")+
        labs(x = "Survival vs Sex")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

ggplot(train,aes(x= Survived, y = Age))+
  geom_boxplot()+
        labs(x = "Survival vs Age Stage")

ggplot(train,aes(x=Survived, fill=Embarked))+
  geom_histogram(stat = "count")+
        labs(x = "Survival vs Embarkment Port")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

ggplot(train,aes(x= Survived, y = Fare))+
  geom_boxplot()+
        labs(x = "Survival vs Fare")

ggplot(train,aes(x= Survived, y = Famsize))+
  geom_boxplot()+
        labs(x = "Survival vs Family Size")

ggplot(train, aes(x = Famsize, fill = Survived)) +
        geom_bar(stat='count', position='dodge') +
        scale_x_continuous(breaks=c(1:11)) +
        labs(x = 'Survival vs Family Size')

Predicting survival in testing data

Build model on training data set.

gfit <- glm(Survived ~ Pclass + Sex + Age + Famsize +
            Fare + Embarked, data = train,
            family="binomial"(link="logit"))

gfit$rule.5 <- ifelse(gfit$fitted.values >= 0.5,"Predict Alive", "Predict Died")
table(gfit$rule.5,gfit$y)
##                
##                   0   1
##   Predict Alive  73 240
##   Predict Died  476 102
prob <- predict(gfit, train, type="response")
pred <- prediction(prob, train$Survived)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
auc <- performance(pred, measure="auc")
auc <- round(auc@y.values[[1]],3)
roc.data <- data.frame(fpr=unlist(perf@x.values),
tpr=unlist(perf@y.values),
model="GLM")
ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.2, fill = "blue") +
geom_line(aes(y=tpr), col = "blue") +
geom_abline(intercept = 0, slope = 1, lty = "dashed") +
labs(title = paste0("ROC Curve w/ AUC=", auc)) +
theme_bw()

The area of the ROC curve is about 85%, much better than guess. The accuracy is about 83%, precision is about 85%, recall is about 70%, specificity is about 92%.

Thus, I will use this model to predict the survival of testing data.

Survival <- predict(gfit, newdata = test)
Survival <- as.numeric(Survival)
Survival_porb <- exp(Survival)/(1+exp(Survival))
Survival_test <- cbind(test$PassengerId,Survival_porb)
colnames(Survival_test) <- c("PassengerID","Survived")
Survival_test[which(Survival_test[,2]<0.5),][,2] <- 0
Survival_test[which(Survival_test[,2]>=0.5),][,2] <- 1
write.table(Survival_test,"Prediction_of_test_data_survival.txt", quote = F, sep ="\t", col.names = T, row.names = F)